Skip to content

Comments

Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859

Closed
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7819-zero3-cpuadam-bias-correction
Closed

Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7819-zero3-cpuadam-bias-correction

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Feb 18, 2026

Fixes #7819

The root cause was non-idempotent CPUAdam step-state handling under ZeRO-3 subgroup updates: repeated calls at the same logical step could take different internal paths and produce slightly different bias-correction metadata.

The fix makes same-step calls a no-op while preserving fast sequential updates, and adds regression tests covering both step_subgroup() and step() subgroup-style paths.
Validated with focused CPUAdam tests.

Make Adam_Optimizer::IncrementStep idempotent for repeated calls at the
same logical step. ZeRO-3/SuperOffload can invoke multiple subgroup updates
for one step on a shared native optimizer object; the previous logic mixed
multiply and recompute paths, producing non-bit-identical bias-correction
metadata between subgroup calls.

This updates both CPU and XPU headers with aligned step-transition logic and
clarifies first-step/non-sequential-step behavior. It also adds CPUAdam
regression tests for subgroup-style repeated same-step updates via both
step_subgroup() and step() param swapping.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tohtana very well-written fix, a few optional work suggested here:

  • Adding a test with non-sequential steps (e.g., jump from step 2 to step 5) to validate the pow fallback path after the refactor
  • Adding a test with beta changes mid-training to exercise the if-branch + fallthrough path

@tohtana
Copy link
Collaborator Author

tohtana commented Feb 18, 2026

Sorry @PKUWZP,
@st-bang97 already addressed this issue in #7820. Let's close this one and focus on #7820.

@tohtana tohtana closed this Feb 18, 2026
tohtana added a commit to tohtana/DeepSpeed that referenced this pull request Feb 18, 2026
Add two unit tests requested in PR deepspeedai#7859 review:
- non-sequential step jump coverage (2 -> 5) to exercise the fallback path
- beta change mid-training coverage to exercise beta-change recompute path

Both tests use deterministic paired-param assertions to ensure updates and
optimizer states remain bit-identical across param-key/subgroup-style calls.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup

2 participants